Introduction

Purpose

This project began in June 2022 as an opportunity to learn web scraping using the language R inside the RStudio (now POSIT) IDE. Quickly the project got more involved as I saw opportunities to learn and practice mapping skills, time series ML predictions, and so much more around data wrangling, plotting and visualizations, performing statistical tests, geocoding, and more web scraping with Python when I ran into a roadblock with an R package and a finicky web page.

Despite the heavy nature of the material, this has been a fun and interesting project, and I have learned so much from the R community that I hope that I can begin to repay that debt to some extent with this project. I think some aspect of what is contained within will help make some analyst’s or data scientist’s life easier, and that is the experience I have had with R and the R community - free and open software and generous widespread sharing of knowledge.

The beauty of R is that I will share this analysis with everyone, and, if anyone is skeptical or curious, which people should be, they can get the same data and challenge my findings and conclusions with their own analysis. I invite you to do so, if you are so inclined. Please feel free to contact me if you would like to chat as I invite all constructive criticism.

Preface

What follows is a analysis of mass shooting phenomena in the US.

Guns. Gun rights. Gun violence. These are highly contentious and controversial issues that come with a host of tuning effects on individual opinions and beliefs.

The intent of this analysis is purely scientific and it has neither political motive nor personal agenda other than to inform people about the state of mass shooting gun violence in the United States of America.

An open and scientific mind was applied to these data both as a learning experience for the author and an informative experience for the public.

Overview

Gun violence is a pervasive issue in American culture. This analysis focuses on mass shootings in particular, and it does not include all types of US gun-related incidents.

The source of these data used in this project is the Gun Violence Archive (GVA), and the original sources of the shootings are news media reports and law enforcement reports. The GVA can be located at https://www.gunviolencearchive.org. Special thanks is due to the GVA for their effort in recording and documenting these data and sharing them with the public.

A mass shooting is defined by the Gun Violence Archive as a shooting in which four or more people are injured or killed not including the perpetrator(s) of the shooting.

I approached this project in three frames: exploratory data analysis, mapping, and time series forecasting.

Setup

Load Libraries

We begin by loading in libraries that help with the data handling and analytical process.

library(tidyverse)
library(tidymodels)
library(tidytext)
library(lubridate)
library(scales)
library(janitor)
library(skimr)
library(tidyquant)
library(broom)
library(purrr)
library(rvest)
library(tidycensus)
library(knitr)
library(leaflet)
library(usmap)
library(reticulate)
library(modeltime)
library(tictoc)
library(future)
library(doFuture)
library(timetk)
library(thief)

registerDoFuture()

n_cores <- parallel::detectCores()

plan(strategy = cluster,
     workers  = parallel::makeCluster(n_cores))

api_key <- Sys.getenv("CENSUS_API_KEY")

Sourcing Data from the GVA

To get the prolific amount of data from the GVA website, we can employ web scraping to read the html and turn it into a tidy tabular format for analysis. I originally began this project as a way to learn web scraping, and the project evolved into the desire to learn even more as I explored the data.

The GVA URLs were different by year, and they even changed during the course of this project. I decided to break the scraping up by year as that is how the website is organized to some extent.

#2014

# create a tibble with a column of the website page names that we want to scrape
ms_pages_2014 <- 
    tibble(pages = c("https://www.gunviolencearchive.org/reports/mass-shootings/2014",
                                  str_c("https://www.gunviolencearchive.org/reports/mass-shootings/2014?page=",c(1:10))))

# use the purrr package to iterate over the column of URLs using the function from the xml2 package
ms_scrape_2014 <- ms_pages_2014 %>%
    mutate(scrape = map(.x = pages, .f = ~read_html(.x)))

# convert the html into a table again using purrr's map function to interate over the scraped data
ms_2014_tbl <- map_df(ms_scrape_2014$scrape, ~ html_table(.x))

# repeat for additional URLs
#2015
ms_pages_2015 <- 
    tibble(pages = c("https://www.gunviolencearchive.org/reports/mass-shootings/2015",
                                  str_c("https://www.gunviolencearchive.org/reports/mass-shootings/2015?page=",c(1:13))))

ms_scrape_2015 <- ms_pages_2015 %>%
    mutate(scrape = map(.x = pages, .f = ~read_html(.x)))

ms_2015_tbl <- map_df(ms_scrape_2015$scrape, ~ html_table(.x))


#2016
ms_pages_2016 <- 
    tibble(pages = c("https://www.gunviolencearchive.org/reports/mass-shooting?year=2016",
                                  str_c("https://www.gunviolencearchive.org/reports/mass-shooting?page=",c(1:15),"&year=2016")))

ms_scrape_2016 <- ms_pages_2016 %>%
    mutate(scrape = map(.x = pages, .f = ~read_html(.x)))

ms_2016_tbl <- map_df(ms_scrape_2016$scrape, ~ html_table(.x))


#2017
ms_pages_2017 <- 
    tibble(pages = c("https://www.gunviolencearchive.org/reports/mass-shooting?year=2017",
                                  str_c("https://www.gunviolencearchive.org/reports/mass-shooting?page=",c(1:13),"&year=2017")))

ms_scrape_2017 <- ms_pages_2017 %>%
    mutate(scrape = map(.x = pages, .f = ~read_html(.x)))

ms_2017_tbl <- map_df(ms_scrape_2017$scrape, ~ html_table(.x))


#2018
ms_pages_2018 <- 
    tibble(pages = c("https://www.gunviolencearchive.org/reports/mass-shooting?year=2018",
                                  str_c("https://www.gunviolencearchive.org/reports/mass-shooting?page=",c(1:13),"&year=2018")))

ms_scrape_2018 <- ms_pages_2018 %>%
    mutate(scrape = map(.x = pages, .f = ~read_html(.x)))

ms_2018_tbl <- map_df(ms_scrape_2018$scrape, ~ html_table(.x))


#2019
ms_pages_2019 <- 
    tibble(pages = c("https://www.gunviolencearchive.org/reports/mass-shooting?year=2019",
                                  str_c("https://www.gunviolencearchive.org/reports/mass-shooting?page=",c(1:16),"&year=2019")))

ms_scrape_2019 <- ms_pages_2019 %>%
    mutate(scrape = map(.x = pages, .f = ~read_html(.x)))

ms_2019_tbl <- map_df(ms_scrape_2019$scrape, ~ html_table(.x))


#2019 through 2022
ms_pages_2019_2022 <- 
    tibble(pages = c("https://www.gunviolencearchive.org/mass-shooting",
                                       str_c("https://www.gunviolencearchive.org/mass-shooting?page=",c(1:79))))


ms_scrape_2019_2022 <- ms_pages_2019_2022 %>%
    mutate(scrape = map(.x = pages, .f = ~read_html(.x)))

ms_2019_2022_tbl <- map_df(ms_scrape_2019_2022$scrape, ~ html_table(.x))

Data Preparation

Once I have collected the different years, we can stack the scraped datasets, and then store the file locally so that we don’t need to continually call on the website.

complete_tbl <- ms_2014_tbl %>%
    bind_rows(ms_2015_tbl) %>%
    bind_rows(ms_2016_tbl) %>%
    bind_rows(ms_2017_tbl) %>%
    bind_rows(ms_2018_tbl) %>%
    bind_rows(ms_2019_tbl) %>%
    bind_rows(ms_2019_2022_tbl) %>%
    distinct()

write_rds(complete_tbl, "gva_data.rds")

Now we can read in the saved data file locally.

complete_tbl <- read_rds("gva_data.rds")

Here I prepare the primary data set to include some date-related fields for convenience when plotting and summarizing the data.

ms_tbl <- complete_tbl %>% 
    janitor::clean_names() %>% 
    select(-operations) %>% 
    mutate(incident_date = mdy(incident_date),
           year = year(incident_date),
           month = month(incident_date, label = T, abbr = T),
           incident_month = rollforward(incident_date),
           dow = wday(incident_date, label = TRUE)) %>% 
    arrange(incident_date)

Grouping and Summarizing Function

This is a function to group and summarize the data in different ways since we will need to do this many times later. Functions are very helpful in R particularly when you need to do the same thing over and over.

group_function <- function(df, ...) {
    
    output <- df %>% 
        group_by(...) %>% 
        summarize(n_incidents = n(),
                  n_deaths = sum(number_killed, na.rm = T),
                  n_injuries = sum(number_injured, na.rm = T),
                  .groups = "drop")
    
    return(output)
}

The Data

First Look

This is an initial look at a sample of ten rows of the data. Each row represents one mass shooting incident. Each incident has its own unique ID, a date, geographic location, and statistics about the number of injuries and deaths as well as some date related data that I added above to assist with analysis.

ms_tbl %>% 
    sample_n(size = 10)

Below is a broad summary of the data. No data are missing. Some notable points with regards to the day of the week and perhaps the month. Deaths and injuries range from zero to 59 and 441 respectively, per incident.

skim(ms_tbl)
Data summary
Name ms_tbl
Number of rows 3992
Number of columns 11
_______________________
Column type frequency:
character 3
Date 2
factor 2
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
state 0 1 4 20 0 49 0
city_or_county 0 1 3 39 0 1032 0
address 0 1 3 54 0 3941 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
incident_date 0 1 2014-01-01 2022-11-13 2019-10-04 1983
incident_month 0 1 2014-01-31 2022-11-30 2019-10-31 107

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
month 0 1 TRUE 12 Jul: 514, Jun: 482, Aug: 429, Sep: 383
dow 0 1 TRUE 7 Sun: 1126, Sat: 951, Fri: 453, Mon: 395

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
incident_id 0 1 1400643.63 715673.90 92194 745126.8 1520523 2041314 2459259 ▅▅▅▆▇
number_killed 0 1 1.05 1.98 0 0.0 1 1 59 ▇▁▁▁▁
number_injured 0 1 4.18 7.29 0 3.0 4 5 441 ▇▁▁▁▁
year 0 1 2018.72 2.54 2014 2017.0 2019 2021 2022 ▃▅▂▆▇

Initial Summaries

  • Since 2014, there have been 3,992 mass shootings, 4,203 people have died, and there have been 16,687 injuries.
  • There is about one death per mass shooting on average.
  • There are about four injuries per mass shooting event on average.
  • There are four times more injuries than deaths (4:1 per incident on average).
ms_tbl %>% 
    group_function() %>% 
    mutate(inj_death_ratio = n_injuries / n_deaths,
           death_inc_ratio = n_deaths / n_incidents,
           inj_inc_ratio = n_injuries / n_incidents)

Plots and Visualizations

Mass Shooting Volume and Trend

Here we see a couple of patterns. First, the volume of mass shootings is increasing. Second, there appears to be a seasonal component to when mass shootings occur meaning there is a tendency and pattern to mass shooting volume consistent with the month of the year.

ms_tbl %>% 
    filter(incident_month < floor_date(Sys.Date(), "month")) %>% 
    group_function(incident_month) %>% 
    ggplot(aes(incident_month, n_incidents)) +
    geom_col(fill = "steelblue", color = "gray30", alpha = 0.9) +
    geom_smooth(method = "loess", se = FALSE, color = "black") +
    scale_x_date(date_labels = "%Y", date_breaks = "1 year") +
    theme_tq() +
    scale_color_tq() +
    labs(title = "Number of Mass Shooting Incidents by Month and Year",
         subtitle = str_glue("From {min(ms_tbl$incident_date)} through {max(ms_tbl %>% filter(incident_month < floor_date(Sys.Date(), 'month')) %>% pull(incident_date))}"),
         x = "Month",
         y = "",
         caption = "current incomplete month removed")

Annual Growth Rate Table

How much have mass shooting incidents, deaths, and injuries been changing over time? To find out, we can use the geometric mean on the annual rates of change to get a percentage increase or decrease from the prior year.

growth_rates <- ms_tbl %>% 
    filter(year(incident_date) < year(Sys.Date())) %>% 
    group_function(year) %>% 
    mutate(across(c(n_incidents:n_injuries), ~. / lag(.), .names = "{.col}_delta")) %>% 
    drop_na()

gr_table <- growth_rates %>% 
    select(year, contains("delta")) %>% 
    mutate(across(contains("delta"), ~. - 1))

gr_table %>% 
    mutate(across(contains("delta"), ~percent(., accuracy = 0.1))) %>% 
    gt::gt()
year n_incidents_delta n_deaths_delta n_injuries_delta
2015 23.1% 34.2% 23.1%
2016 14.0% 22.8% 15.1%
2017 -9.1% -2.6% 17.4%
2018 -3.4% -15.6% -26.4%
2019 24.1% 25.0% 28.7%
2020 46.3% 10.3% 48.4%
2021 13.1% 37.2% 11.4%

Annual Growth Rate Graphic

This chart shows how the volume of mass shootings and the resulting deaths and injuries have either increased or decreased over the years. Anything above zero is an increase from the prior year, and anything below zero is a decrease.

Only two years (2017 and 2018) saw a modest decrease in the number of mass shootings and every other year has seen growth. Deaths decreased modestly in two years (2017 and 2018), but increased in all other years, and injuries only saw one year of decrease (2018).

gr_table %>% 
    pivot_longer(cols = contains("delta")) %>% 
    ggplot(aes(year, value, color = name)) +
    geom_segment(aes(x = year, xend = year, y = 0, yend = value)) +
    geom_point(size = 5) +
    geom_text(aes(label = percent(value, accuracy = 0.1)), size = 3, color = "black", vjust = -1.2) +
    geom_hline(yintercept = 0, color = "gray30") +
    expand_limits(x = c(2014.5, 2021.5), y = c(-0.25, 0.5)) +
    scale_y_continuous(labels = label_percent(accuracy = 1)) +
    theme_tq() +
    scale_color_tq() +
    facet_grid(~fct_inorder(name)) +
    labs(title = "Annual Percentage Change in Volume of Mass Shootings, Deaths, and Injuries",
         x = "",
         y = "% Change From Prior Year",
         caption = "current incomplete year removed") +
    theme(legend.position = "none",
          plot.title = element_text(size = 12))

Total Growth Rate

This is the average growth rate over the entire period of time in total. Effectively, you can read this summary as the annual average increase in the number of incidents, deaths, and injuries since 2014. On average, mass shoooting incidents, deaths, and injuries have been growing at a rate of ~14% per year since 2014.

gr_table %>% 
    mutate("Start Year" = min(year)-1,
           "End Year" = max(year)) %>% 
    group_by(`Start Year`, `End Year`) %>% 
    summarize(across(contains("delta"), ~exp(mean(log(. + 1))) - 1),
              .groups = "drop") %>% 
    mutate(across(contains("delta"), ~percent(., accuracy = 0.1))) %>% 
    pivot_longer(cols = contains("delta"), names_to = "Type", values_to = "Growth Rate") %>% 
    gt::gt()
Start Year End Year Type Growth Rate
2014 2021 n_incidents_delta 14.2%
2014 2021 n_deaths_delta 14.4%
2014 2021 n_injuries_delta 14.7%

Incidents Annual Volume

How many mass shootings do we see each year?

ms_tbl %>% 
    filter(year(incident_date) < year(Sys.Date())) %>% 
    group_function(year) %>% 
    ggplot(aes(year, n_incidents, fill = n_incidents)) +
    geom_col(alpha = 0.6) +
    geom_text(aes(label = n_incidents), nudge_y = -20) +
    theme_tq() +
    scale_fill_gradient(high = "red", low = "blue") +
    scale_x_continuous(breaks = seq(min(ms_tbl$year), max(ms_tbl$year), by = 1)) +
    labs(title = "Number of Mass Shootings by Year",
         subtitle = str_glue("From {min(ms_tbl %>% filter(year < year(Sys.Date())) %>% pull(incident_date))} through {max(ms_tbl %>% filter(year < year(Sys.Date())) %>% pull(incident_date))}"),
         x = "Year",
         y = "",
         caption = "current incomplete year removed") +
    theme(legend.position = "none")

Incidents by Month

Which months tend to have more or less mass shootings?

ms_tbl %>% 
    filter(year(incident_date) < year(Sys.Date())) %>% 
    group_function(month) %>% 
    ggplot(aes(month, n_incidents, fill = n_incidents)) +
    geom_col(alpha = 0.6) +
    geom_text(aes(label = n_incidents), nudge_y = 15) +
    theme_tq() +
    scale_fill_gradient(high = "red", low = "blue") +
    labs(title = "Number of Mass Shootings by Month (Year Over Year)",
         subtitle = str_glue("From {min(ms_tbl %>% filter(year < year(Sys.Date())) %>% pull(incident_date))} through {max(ms_tbl %>% filter(year < year(Sys.Date())) %>% pull(incident_date))}"),
         x = "",
         y = "",
         caption = "current incomplete year removed") +
    theme(legend.position = "none")

Seasonal Statistics

Follow up question. What proportion of shootings happen in the different seasons (roughly approximated by season starting month)?

ms_tbl %>% 
    filter(year(incident_date) < year(Sys.Date())) %>% 
    group_function(incident_id, month) %>% 
    select(month:n_incidents) %>% 
    summarize(n = n(),
              winter = sum(n_incidents[month %in% c("Dec","Jan","Feb")]),
              spring = sum(n_incidents[month %in% c("Mar","Apr","May")]),
              summer = sum(n_incidents[month %in% c("Jun","Jul","Aug")]),
              fall =   sum(n_incidents[month %in% c("Sep","Oct","Nov")]),
              .groups = "drop") %>% 
    mutate(spring_prop = spring / n,
           summer_prop = summer / n,
           fall_prop = fall / n,
           winter_prop = winter / n) %>% 
    select(contains("prop")) %>% 
    mutate(across(everything(), ~ percent(., accuracy = 0.1)))

Mass shootings are approximately 10% more common in the summer months and approximately 10% less common in the winter when compared to spring and fall months.

Incidents by Weekday

Which days of the week tend to have more or less mass shootings?

ms_tbl %>% 
    filter(year(incident_date) < year(Sys.Date())) %>% 
    group_function(dow) %>% 
    ggplot(aes(dow, n_incidents, fill = n_incidents)) +
    geom_col(alpha = 0.6) +
    geom_text(aes(label = n_incidents), nudge_y = 30) +
    theme_tq() +
    scale_fill_gradient(high = "red", low = "blue") +
    labs(title = "Number of Mass Shootings by Weekday (Year Over Year)",
         subtitle = str_glue("From {min(ms_tbl %>% filter(year < year(Sys.Date())) %>% pull(incident_date))} through {max(ms_tbl %>% filter(year < year(Sys.Date())) %>% pull(incident_date))}"),
         x = "",
         y = "",
         caption = "current incomplete year removed") +
    theme(legend.position = "none")

Weekend Proportion

Follow up question. What is the proportion of weekend mass shootings?

ms_tbl %>% 
    filter(year(incident_date) < year(Sys.Date())) %>% 
    group_function(incident_id, dow) %>% 
    select(dow:n_incidents) %>% 
    summarize(n = n(),
              sat_sun = sum(n_incidents[dow %in% c("Sat","Sun")]),
              .groups = "drop") %>% 
    transmute(weekend_prop = sat_sun / n,
              weekend_prop = percent(weekend_prop, accuracy = 0.1))

About 51% of mass shootings occur on weekends.

Month and Weekday Proportions

Here is one more visualization to help us see the number of mass shootings by day of the week and the month in which they occur. Each row of bars represents 100% of the shootings in the given days and months.

ms_tbl %>% 
    filter(year(incident_date) < year(Sys.Date())) %>% 
    group_function(month, dow) %>% 
    select(month:n_incidents) %>% 
    ggplot(aes(fct_rev(month), n_incidents, fill = dow)) +
    geom_col(position = position_fill(reverse = TRUE), alpha = 0.8) +
    coord_flip() +
    scale_y_continuous(labels = scales::label_percent(accuracy = 1)) +
    theme_tq() +
    labs(title = "Proportion of Mass Shootings by Month and Weekday, Year Over Year",
         subtitle = str_glue("From {min(ms_tbl %>% filter(year < year(Sys.Date())) %>% pull(incident_date))} through {max(ms_tbl %>% filter(year < year(Sys.Date())) %>% pull(incident_date))}"),
         x = "",
         y = "",
         caption = "current incomplete year removed",
         fill = "Day of Week") + 
    guides(fill = guide_legend(nrow = 1, byrow = TRUE))

Statistical Tests

Relationships Between Variables

We have seen that there are some days that have higher numbers of mass shootings as well as some months that are higher than others. That creates questions.

Chi-Squared Test - Weekday

Is there a significant difference in the number of mass shootings by the day of the week? We can use a chi-squared test checking for the goodness of fit where we essentially test the assumption that there is no difference in the proportion of mass shootings by day of the week. In other words, each day should have proportionally the same volume of mass shootings.

chi_dow_tbl <- ms_tbl %>% 
    filter(year(incident_date) < year(Sys.Date())) %>% 
    group_function(incident_id, dow) %>% 
    select(dow:n_incidents)

chi_dow_tbl %>% table()
##      n_incidents
## dow     1
##   Sun 948
##   Mon 342
##   Tue 283
##   Wed 326
##   Thu 312
##   Fri 382
##   Sat 800
chisq_test(chi_dow_tbl, 
           response = dow,
           p = c("Sun" = 1/7,
                 "Mon" = 1/7,
                 "Tue" = 1/7,
                 "Wed" = 1/7,
                 "Thu" = 1/7,
                 "Fri" = 1/7,
                 "Sat" = 1/7))

The answer here is yes, there is a significant difference in the volume of mass shooting incidents in different days of the week at the p < 0.05 level.

Chi-Squared Test - Month

Is there a significant difference between months in terms of the number of mass shooting incidents?

chi_month_tbl <- ms_tbl %>% 
    filter(year(incident_date) < year(Sys.Date())) %>% 
    group_function(incident_id, month) %>% 
    select(month:n_incidents)

chi_month_tbl %>% table()
##      n_incidents
## month   1
##   Jan 189
##   Feb 180
##   Mar 189
##   Apr 244
##   May 316
##   Jun 418
##   Jul 425
##   Aug 366
##   Sep 319
##   Oct 283
##   Nov 261
##   Dec 203
chisq_test(chi_month_tbl, 
           response = month,
           p = c("Jan" = 1/12,
                 "Feb" = 1/12,
                 "Mar" = 1/12,
                 "Apr" = 1/12,
                 "May" = 1/12,
                 "Jun" = 1/12,
                 "Jul" = 1/12,
                 "Aug" = 1/12,
                 "Sep" = 1/12,
                 "Oct" = 1/12,
                 "Nov" = 1/12,
                 "Dec" = 1/12))

Yes, there is a significant difference between some months at the p < 0.05 level.

Chi-Squared Test - Weekday and Month

Is there a statistically significant relationship between the day of the week when a mass shooting occurs and the month in which the shooting happens?

We can use a chi-squared test of independence to test our question.

chi_tbl <- ms_tbl %>% 
    filter(year(incident_date) < year(Sys.Date())) %>% 
    group_function(incident_id, month, dow) %>% 
    select(month:n_incidents)

chi_tbl %>% 
    table()
## , , n_incidents = 1
## 
##      dow
## month Sun Mon Tue Wed Thu Fri Sat
##   Jan  50  21  15  23  21  28  31
##   Feb  46  16  18  20  16  18  46
##   Mar  55  15  14  16  12  32  45
##   Apr  71  20  20  21  23  26  63
##   May  95  30  23  34  22  31  81
##   Jun 113  40  42  40  40  41 102
##   Jul 110  53  30  45  52  40  95
##   Aug 115  38  32  35  23  36  87
##   Sep  96  37  27  31  27  22  79
##   Oct  70  25  24  25  34  31  74
##   Nov  71  29  19  24  25  37  56
##   Dec  56  18  19  12  17  40  41
chisq_test(x = chi_tbl, formula = dow ~ month)

The answer is that we don’t have enough evidence to reject the null hypothesis which asserts that there is no difference between the mass shooting days of the week and the months in which they occur. The Sundays in Jun, Jul, and Aug stand out with higher volumes, but there isn’t a statistically significant difference in those months and days of the week at the p < 0.05 significance level.

This is interesting given that the days of the week and the months tests were both significant, but in concert there is no evidence to suggest that they are significantly different.

Seasonality and Trend Exploration

We noted before that there are seasonal trends in the data, so I am going to bring those forward using moving averages which are simply an average of the volume of mass shootings over a given window of time. In this case, I will use two windows - one for a six month moving average to highlight the seasonality, and a second moving average of 12 months to highlight the trend of increasing numbers of mass shootings. We appear to have a new post-pandemic normal.

ma_tbl <- ms_tbl %>% 
    filter(incident_date < floor_date(Sys.Date(), "month")) %>% 
    arrange(incident_date) %>% 
    group_function(incident_month)

ma_tbl %>% 
    ggplot(aes(incident_month, n_incidents)) +
    geom_col(fill = "steelblue", color = "gray30", alpha = 0.4) +
    geom_line(data = ma_tbl %>%
                select(1,2) %>% 
                mutate(ma_6 = rollmean(n_incidents, k = 6, fill = NA, align = "right")),
                aes(incident_month, ma_6), color = "#F44336", size = 1.5) +
    geom_line(data = ma_tbl %>%
                select(1,2) %>% 
                mutate(ma_12 = rollmean(n_incidents, k = 12, fill = NA, align = "right")),
                aes(incident_month, ma_12), color = "black", size = 1.5)+
    theme_tq() +
    scale_color_tq() +
    labs(title = "Mass Shooting Monthly Incident Count Moving Averages",
         subtitle = "Red line is the 6 month moving average, black line is the 12 month moving average",
         x = "",
         y = "Number of Mass Shooting Incidents",
         caption = "current incomplete month removed")

It’s now clearer to see the increasing trend in the twelve month moving average line in black, and the seasonal monthly trend of the six month moving average in red. You can see how the red line moves above and below the black line clearly revealing the seasonal component of mass shootings.

Additionally, we can also see that since 2020, the variability in the number of mass shootings has increased as the red line goes much higher than in previous years. This is obvious given the increase in volume, but it’s worth noting the change in the total system behavior. Further, whilst 2020 was an aberrant year given the pandemic, assuming a potential relationship between the impact of the pandemic on US society and mass shooting behavior, the prevalence has not returned back to pre-pandemic levels given that we are now in a post-pandemic environment.

Incidents, Deaths, and Injuries by Month

This is a time series that takes incidents as we have seen previously and lays out deaths and injury volume by month.

ms_tbl %>% 
    filter(incident_month < floor_date(Sys.Date(), "month")) %>% 
    group_function(incident_month) %>% 
    rename("Incidents" = n_incidents,"Deaths" = n_deaths,"Injuries" = n_injuries) %>% 
    pivot_longer(cols = -1) %>% 
    mutate(name = factor(name, levels = c("Incidents","Injuries","Deaths"))) %>% 
    ggplot(aes(incident_month, value, group = name, color = name)) +
    geom_line() +
    geom_point() +
    geom_smooth(method = "loess", se = FALSE) +
    scale_x_date(date_labels = "%Y", date_breaks = "3 years") +
    expand_limits(y = 0) +
    theme_tq() +
    scale_color_tq() +
    facet_wrap(~ name, scales = "free_y") +
    labs(title = "Number of Mass Shooting Incidents by Month, Injuries and Deaths",
         subtitle = str_glue("From {min(ms_tbl$incident_date)} through {max(ms_tbl %>% filter(incident_month < floor_date(Sys.Date(), 'month')) %>% pull(incident_date))}"),
         x = "",
         y = "") +
    theme(legend.position = "none")

US States

States with No Mass Shootings

Is there a state that has not had a mass shooting?

us_states <- tibble("state" = state.name) %>% 
    bind_cols(tibble("state_code" = state.abb)) %>% 
    bind_rows(tibble(state = "District of Columbia",
                     state_code = "DC")) %>% 
    arrange(state)

ms_states_tbl <- ms_tbl %>% 
    inner_join(us_states, by = "state") %>% 
    distinct(state_code)

no_mass_shooting_states <- tibble("state_code" = state.abb,
                                  "state_name" = state.name) %>% 
    anti_join(ms_states_tbl, by = "state_code") %>% 
    pull(state_name)

state_count <- length(no_mass_shooting_states)

Yes, 2 states have not had mass shootings since these data were collected beginning in 2014: Hawaii and North Dakota

States with the Most Mass Shootings

Let’s look at the number of mass shootings by state. This gives us a look at totals.

ms_tbl %>% 
    group_function(state) %>% 
    inner_join(us_states, by = "state") %>% 
    ggplot(aes(fct_reorder(state, n_incidents), n_incidents)) +
    geom_col(fill = "steelblue", color = "gray30", alpha = 0.8) +
    geom_text(aes(label = n_incidents), hjust = -0.1, size = 2.5) +
    scale_y_continuous(limits = c(0, 450), expand = c(0, 0)) +
    coord_flip() +
    theme_tq() +
    labs(title = "States Ordered by Number of Mass Shootings",
         subtitle = str_glue("{min(ms_tbl$incident_date)} through {max(ms_tbl$incident_date)}"),
         x = "",
         y = "Mass Shootings") +
    theme(axis.text.y = element_text(size = 6))

state_stats <- ms_tbl %>% 
    group_function(state) %>% 
    inner_join(us_states, by = "state") %>% 
    mutate(incidents_rank = min_rank(desc(n_incidents)))

top_five_states <- state_stats %>% 
    filter(incidents_rank %in% c(1:5)) %>% 
    arrange(incidents_rank) %>% 
    pull(state)

top_states_for_time_series <- state_stats %>% 
    filter(incidents_rank %in% c(1:4)) %>% 
    arrange(incidents_rank) %>% 
    pull(state)

The five states with the greatest number of mass shootings are: Illinois, California, Texas, Florida, and Pennsylvania.

We should consider proportionally how likely it is to encounter a mass shooting (or being injured or killed) instead of just looking at the total because each state has widely differing total populations. In other words, given the population of a state, which states have the greatest number of mass shootings (or deaths or injuries) per person? I will use the 2020 US decennial census population data by state and the number of incidents, injuries, and deaths by state in 2021. I am including Washington, D.C. as a state. Another way to approach this would be to sum all of the years compared to a 2020 population, but I elected to simply choose the most recent complete year on record.

Top Mass Shootings States Per Capita

Which ten states have the most mass shootings, deaths, or injuries per capita?

us_state_pop_tbl <- get_decennial(geography = "state", 
                       variables = "P1_001N", 
                       year = 2020)

per_capita_base <- 1e5

per_capita_tbl <- ms_tbl %>% 
    filter(year == 2021) %>% 
    group_function(state) %>% 
    group_by(state) %>% 
    inner_join(us_state_pop_tbl %>% 
                   select("state" = NAME, "population" = value), by = "state") %>% 
    mutate(incidents_per_capita = n_incidents / (population / per_capita_base),
           deaths_per_capita = n_deaths / (population / per_capita_base),
           injuries_per_capita = n_injuries / (population / per_capita_base))

per_capita_tbl %>% 
    select(state, contains("per_capita")) %>% 
    pivot_longer(cols = contains("per_capita")) %>% 
    mutate(name = fct_relevel(name, c("incidents_per_capita","deaths_per_capita","injuriess_per_capita"))) %>% 
    group_by(name) %>% 
    slice_max(order_by = value, n = 10) %>% 
    ggplot(aes(reorder_within(x = state, by = value, within = name, fun = sum), value, fill = name)) +
    geom_col(alpha = 0.8) +
    geom_text(aes(label = comma(value, accuracy = 0.1)), size = 2, hjust = 1.1, color = "white") +
    coord_flip() +
    facet_wrap(~ name, scales = "free") +
    scale_x_reordered() +
    scale_y_continuous(label = label_comma(accuracy = 0.1)) +
    theme_tq() +
    scale_fill_tq() +
    labs(title = "Per Capita Mass Shootings, Deaths, and Injuries by US State in 2021",
         x = "",
         y = "Per 100,000 People") +
    theme(axis.text.y = element_text(size = 6),
          legend.position = "none")

US Cities

Top 12 Cities

Which cities tend to see the most mass shooting incidents since 2014? Are they changing over time?

Twelve seems like an odd number to choose, but it just happens to plot better.

top_cities <- ms_tbl %>% 
    filter(year < year(Sys.Date())) %>% 
    group_by(state, city_or_county) %>% 
    summarize(number_incidents = n(),
              number_killed = sum(number_killed),
              number_injured = sum(number_injured),
              .groups = "drop") %>% 
    slice_max(order_by = number_incidents, n = 12) %>% 
    distinct(state, city_or_county) %>% 
    transmute(city_list = str_c(city_or_county, state, sep = ", "))

top_cities_incidents <- ms_tbl %>% 
    mutate(city_list = str_c(city_or_county, state, sep = ", ")) %>% 
    filter(year < year(Sys.Date()), 
           city_list %in% top_cities$city_list) %>% 
    group_by(state, city_or_county, year) %>% 
    summarize(number_incidents = n(),
              number_killed = sum(number_killed),
              number_injured = sum(number_injured),
              .groups = "drop") %>% 
    complete(year, nesting(state, city_or_county), 
             fill = list(number_incidents = 0, number_killed = 0, number_injured = 0)) %>% 
    ungroup()

top_cities_incidents %>% 
    ggplot(aes(year, number_incidents, fill = city_or_county)) +
    geom_col(alpha = 0.9) +
    geom_text(aes(label = number_incidents), fill = "white", vjust = 1.5, size = 2, color = "white") +
    facet_wrap(~ fct_reorder(str_c(city_or_county, state, sep = ", "), -number_incidents, sum), scales = "free_y") +
    expand_limits(y = 0) +
    theme_tq() +
    scale_fill_tq() +
    scale_y_continuous(labels = label_comma(accuracy = 1)) +
    labs(x = "",
         y = "",
         title = str_glue("Top {length(unique(top_cities_incidents$state))} US Cities by Total Number of Mass Shooting Incidents"),
         subtitle = "Cities are Sorted in Descending Order",
         caption = "current incomplete year removed") +
    theme(legend.position = "none",
          strip.text.x = element_text(size = 6),
          axis.text.x = element_text(size = 5),
          axis.text.y = element_text(size = 5))

US Cities with 20+ Mass Shootings

Which cities have had more than 20 mass shootings since 2014?

Selection Criteria - Why 20?

The number 20 is somewhat arbitrary, but when looking at the counts of incidents by city, 20 seems to be near an inflection point at which the prevalence of mass shootings in a given city becomes significantly higher. Ten would have also been a reasonable threshold to choose.

ranked_cities <- ms_tbl %>% 
    filter(year < year(Sys.Date())) %>% 
    inner_join(us_states, by = "state") %>% 
    group_by(city = str_c(city_or_county, state_code, sep = ", ")) %>% 
    summarize(number_incidents = n(),
              number_killed = sum(number_killed),
              number_injured = sum(number_injured),
              min_year = min(year),
              max_year = max(year),
              .groups = "drop") %>% 
    arrange(desc(number_incidents)) %>% 
    mutate(rank = row_number(),
           city = fct_reorder(city, number_incidents, max))

top_5_cities <- ranked_cities %>% 
    slice_min(order_by = rank, n = 5) %>% 
    pull(city)

ranked_cities %>% 
    ggplot(aes(rank, number_incidents)) +
    geom_point(alpha = 0.9, shape = 1, color = "gray40", size = 1) +
    geom_hline(yintercept = 20, color = "blue") +
    scale_x_continuous(breaks = seq(0,1000,50)) +
    scale_y_continuous(breaks = seq(0,300,20)) +
    theme_tq() +
    labs(x = "City Rank",
         y = "Number of Incidents",
         title = str_glue("US Cities Mass Shooting Incidents Counts by City Rank"),
         subtitle = str_glue("From {ranked_cities$min_year} to {ranked_cities$max_year}"))

Cities with the Most Mass Shootings

Here are those cities with 20 or more mass shootings since 2014.

ranked_cities %>% 
    filter(number_incidents >= 20) %>% 
    ggplot(aes(city, number_incidents, fill = number_incidents)) +
    geom_col(alpha = 0.9) +
    geom_text(aes(label = number_incidents), size = 2, color = "white", hjust = 1.2) +
    coord_flip() +
    scale_fill_gradient(low = "blue2", high = "red4") +
    scale_y_continuous(expand = c(0,0)) +
    theme_tq() +
    labs(x = "",
         y = "",
         title = str_glue("US Cities with 20 or More Mass Shooting Incidents"),
         subtitle = str_glue("From {ranked_cities$min_year} to {ranked_cities$max_year}")) +
    theme(legend.position = "none",
          axis.text.x = element_text(size = 6),
          axis.text.y = element_text(size = 6))

Mapping

The initial reason I started this project was to work on a web scraping endeavor. Scraping the initial mass shooting data was very straightforward, and I learned a great deal about how to accomplish that process in R.

However, as things progressed and I wanted to mine more of the incident data, I discovered that some pages of the GVA website has a robot checking algorithm in place to verify that a browser is likely a human and not a bot. There were several helpful directives online regarding using selenium as a headless browser, which I started doing in using the RSelenium package. The website still recognized that I was using a controlled browser, and it denied my access.

Eventually, I discovered that I needed to adjust some of the Chrome options, and I was not able to make that happen in R, so I knew that I needed to switch to Python from the many solutions I found online. I did just that, and it was a solid learning experience for me as I have been working on enhancing my Python skills, and this was a practical avenue to move forward with that objective.

The following took a few weeks of research, and there were a lot of late nights and weekends put into this effort. In all, it was worth it, but I did question pursuing this at times given the level of effort just to scrape these pages. I really wanted to get the best latitude and longitude information, and an earlier attempt to get coordinates using the gecoded addresses was not the best given that many of the addresses are missing detailed information. GVA seems to have better data on the coordinates behind the scenes.

I started by pulling the incident IDs from the GVA data which then become part of the URL for the incident-specific information. This was from the original data pull for all incidents, but after gathering all of the historical data, I only need to update with new incidents moving forward.

incident_ids <- ms_tbl %>%
    distinct(incident_id) # get a list of all of the incident IDs

geo_locations_incidents_tbl <- incident_ids %>%
    mutate(pages = str_glue("https://www.gunviolencearchive.org/incident/{incident_id}"))

Here we only need to evaluate on the new incidents and then look up those geolocations.

ms_tbl_updated_geo_locs <- read_rds("ms_tbl_updated_geo_locs.rds")

new_incident_ids <- ms_tbl %>% 
    anti_join(ms_tbl_updated_geo_locs, by = "incident_id") %>% 
    distinct(incident_id)

new_geo_locations_incidents_tbl <- new_incident_ids %>%
    mutate(pages = str_glue("https://www.gunviolencearchive.org/incident/{incident_id}"))

Webscraping with Python

Now we can scrape using Python which gives us more flexibility than RSelenium did (at least as far as I was able to discern).

# run these in terminal
# pip install selenium
# pip install chromedriver
# pip install beautifulsoup4

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
from time import sleep
import pandas as pd

options = Options() # these are the options needed to hide that the browser was being controlled by a computer and not a human
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_argument("--profile-directory=Default")
options.add_experimental_option("useAutomationExtension", False)

driver = webdriver.Chrome(options=options) # this established the browser
first_page = "https://www.gunviolencearchive.org/incident/882422" # this is a default page to start with
driver.get(first_page) # this opens the URL in the established browser

# from here, I had to click on a second tab manually, and open another page on the same website which cleared the bot checker I was running into

# then, I was able to run the loop as follows

url_list = r.new_geo_locations_incidents_tbl # creates a dataframe in python from the R dataframe created earlier
temp_list = {
    "geo_data": []
} # this is an empty list container that will hold the HTML text as we iterate over the pages.

for url in url_list["pages"]: # loop to open the pages, parse the HTML, parse, the text from the page, pop it into a list, rinse, and repeat
    driver.get(url)
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    geo_loc = soup.get_text()
    temp_list["geo_data"].append(geo_loc)
    
df = pd.DataFrame.from_dict(temp_list) # this moves the list into a dataframe

driver.close() # this closes the session.

# note that I was not able to iterate of 4000 pages as the bot detector blocked my IPs from my VPN a few times
# I broke the chunks down into 200 to 300 at a time, and that seemed to do the trick
# I probably could have used a sleep command to slow the roll, which might have appeased the bot police....

Pull the Python object into R, and back up the scraped data in case of R crash en route.

geo_data <- py$df # this command moves a Python object into an R object...Reticulate is pretty sweet!

write_rds(geo_data, "scraped_geo_data.rds") # I stored the data as I went in case R crashed

Update the geolocations file with the new records and save it.

geo_data <- read_rds("scraped_geo_data.rds")

output <- geo_data %>% 
    transmute(geo_data = str_extract(geo_data, "Geolocation.*")) # pull the geo location data

ms_tbl_updated <- ms_tbl %>% 
    anti_join(ms_tbl_updated_geo_locs, by = "incident_id") %>% 
    bind_cols(output) # combine the original data with the new geo location data 

tbl_to_update <- read_rds("ms_tbl_updated_geo_locs.rds") # use this to source a previously stored table and append it

ms_tbl_updated <- tbl_to_update %>% 
    bind_rows(ms_tbl_updated) # append the new data rows to the old

write_rds(ms_tbl_updated, "ms_tbl_updated_geo_locs.rds") # save it

ms_tbl_updated %>% 
    summarize(n = n(),
              n_dist = n_distinct(incident_id)) # check it

Prepare the Map Data

ms_tbl_map_data <- read_rds("ms_tbl_updated_geo_locs.rds")

ms_tbl_map_prep <- ms_tbl_map_data %>% 
    rename("GeoLocation" = geo_data) %>% 
    mutate(temp = GeoLocation,
           temp = str_remove_all(temp, "Geolocation: ")) %>% 
    separate(temp, into = c("latitude","longitude"), sep = ", ") %>% 
    mutate(across(latitude:longitude, ~ as.numeric(.)),
           color_year = factor(year),
           size = number_killed + number_injured,
           date_formatted = format.Date(incident_date, format = "%b %d, %Y"),
           label = str_glue("{city_or_county}, {state}
                             {date_formatted}
                             Injured: {number_injured} | Killed: {number_killed}"))

Mapping Incidents with Leaflet

Leaflet creates an interactive map that lets you zoom in and out and also mouse over incidents to get more information.

ms_tbl_map_prep %>% 
    leaflet(height = 1000, width = 1000) %>%
    addProviderTiles("Stamen.Toner") %>% 
    addProviderTiles("Esri", group = "Esri") %>%
    addCircleMarkers(label = lapply(str_glue("{month(ms_tbl_map_prep$incident_date, label = TRUE, abbr = TRUE)} {year(ms_tbl_map_prep$incident_date)} <br/> {ms_tbl_map_prep$city_or_county}, {ms_tbl_map_prep$state} <br/> {ms_tbl_map_prep$number_killed} killed | {ms_tbl_map_prep$number_injured} wounded"),
                                    htmltools::HTML), radius = 3, color = "blue", clusterOptions = markerClusterOptions()) %>%
    setView(lng = -90, lat = 20, zoom = 3.25) %>%
    addLayersControl(baseGroups = c("Stamen.Toner","Esri"))

Map of the 10 worst mass shootings

worst_shootings <- ms_tbl_map_prep %>% 
    slice_max(order_by = size, n = 10) %>% 
    slice(1:10)

worst_transformed <- usmap_transform(worst_shootings, input_names = c("longitude", "latitude"), output_names = c("x", "y"))

set.seed(1984)
plot_usmap(fill = "goldenrod1", alpha = 0.8) +
    geom_point(data = worst_transformed, aes(x = x, y = y, size = size, color = color_year, text = label)) +
    ggrepel::geom_label_repel(data = worst_transformed, 
                              aes(x = x, y = y, label = label), 
                              size = 2,
                              color = "black", 
                              min.segment.length = 0.1, 
                              segment.colour = "blue3", 
                              box.padding = 1,
                              arrow = arrow(length = unit(0.01, "npc"))) +
    scale_color_tq() +
    labs(title = str_glue("Map of the Ten Worst Mass Shooting Incidents Between {year(min(ms_tbl$incident_date))} and {year(max(ms_tbl$incident_date))}"),
         subtitle = "Point size represents the total deaths and injuries per incident") +
    theme(legend.position = "none")

Time Series Modeling

Time series modeling is a form of machine learning that uses clues in the time series data to create predictive models about what might happen based upon events that have already occurred. I will use data from some of the highest mass shooting states to create predictive time series models. Perhaps leaders from these states can elaborate on these models to help identify when things might get worse and then deploy community resources and programs to try to reduce the number of mass shootings.

Prepare and Nest Time Series Data

I am planning on projecting out 12 months, and the following code sets up the data so that we can iterate over it with several different models and then pick the best one for the forecasts.

One of the coolest features of R is that you can create data tables within data tables meaning that a column which typically contains data about a record can instead contain an entire data set. These are called nested data sets, and they are really lists within lists. Data tables and data frames and tibbles are all simply different names for lists. Nested lists are super useful for keeping data tidy and particularly for iterative modeling of many types.

A convenient aspect of the modeltime package is that it automatically holds out testing data for you so that you don’t have to create separate training and testing data albeit not that complicated to do. This is especially nice with time series data because the testing data is the most recent data as opposed to non-time series modeling where a random set of data are withheld during the training process.

future_date <- 12

ms_model_state_data <- ms_tbl %>%
    filter(incident_date < floor_date(Sys.Date(), "month"),
           year(incident_date) >= 2020) %>%
    group_function(state, incident_month) %>%
    select(-n_injuries, -n_deaths) %>%
    complete(state, incident_month, fill = list(n_incidents = 0)) %>%
    filter(state %in% top_states_for_time_series)

nested_data_tbl <- ms_model_state_data %>% 
    group_by(state) %>%
    modeltime::extend_timeseries(.id_var = state,
                                 .date_var = incident_month,
                                 .length_future = future_date) %>%
    modeltime::nest_timeseries(.id_var = state,
                    .length_future = future_date) %>%
    modeltime::split_nested_timeseries(.length_test = future_date)

Prepare Recipes and Model Workflows

This code sets up the model recipes and creates workflows that use the recipes and their instructions. I did not use any hyperparameter tuning in this project, but I would recommend that for a more in depth application.

# xgboost recipe
rec_xgb <- recipe(n_incidents ~ ., extract_nested_train_split(nested_data_tbl)) %>%
    step_timeseries_signature(incident_month) %>%
    step_rm(incident_month) %>%
    step_zv(all_predictors()) %>%
    step_dummy(all_nominal_predictors(), one_hot = TRUE)

# arima recipe
rec_arima <- recipe(n_incidents ~ ., extract_nested_train_split(nested_data_tbl)) %>%
    step_timeseries_signature(incident_month) %>%
    step_zv(all_predictors()) %>%
    step_dummy(all_nominal_predictors(), one_hot = TRUE)

# xgboost models

wflw_xgb_1 <- workflow() %>%
    add_model(boost_tree("regression", learn_rate = 0.35) %>% set_engine("xgboost")) %>%
    add_recipe(rec_xgb)

wflw_xgb_2 <- workflow() %>%
    add_model(boost_tree("regression", learn_rate = 0.50) %>% set_engine("xgboost")) %>%
    add_recipe(rec_xgb)

# random forest model
wflw_rf_1 <- workflow() %>%
    add_model(rand_forest("regression") %>% set_engine("ranger")) %>%
    add_recipe(rec_xgb)

# arima model
wflw_arima_1 <- workflow() %>%
    add_model(arima_boost("regression") %>% set_engine("arima_xgboost")) %>%
    add_recipe(rec_arima)

# prophet model
wflw_prophet_1 <- workflow() %>%
    add_model(prophet_boost("regression") %>% set_engine("prophet_xgboost")) %>%
    add_recipe(rec_arima)

Run models

The following section fits the models to the data.

# # 2.0 SCALE ----
parallel_start(16)

nested_modeltime_tbl <- nested_data_tbl %>%
    modeltime_nested_fit(
        model_list = list(
            wflw_xgb_1,
            wflw_xgb_2,
            wflw_rf_1,
            wflw_arima_1,
            wflw_prophet_1),
        control = control_nested_fit(
            verbose   = TRUE,
            allow_par = TRUE))

write_rds(nested_modeltime_tbl, "nested_modeltime_tbl.rds")

Review Model Forecasts

Here is a look at the various models and how they did in their predictive capabilities compared to the held out actual data. Earlier, I elected to choose the top 4 states for this time series analysis simply due to plotting space. In reality, and this is one of the reasons that the modeltime package is so powerful, you can run time series models across any number of states (or whatever grouping variable you are interested in). This makes light work of the process of testing multiple models across multiple states.

nested_modeltime_tbl <- read_rds("nested_modeltime_tbl.rds")

nested_modeltime_tbl %>%
    extract_nested_test_forecast() %>%
    group_by(state) %>%
    plot_modeltime_forecast(.facet_ncol = 2)

Choose the best models

We select the most accurate models based upon the mean absolute error (MAE) metric. MAE expresses the relative predictive error of models, and we select the models with the lowest relative error.

nested_best_tbl <- nested_modeltime_tbl %>%
    modeltime_nested_select_best(metric = "mae", minimize = TRUE)

write_rds(nested_best_tbl, "best_models.rds")

Visualize Best Models

Here is a final look at the best model’s predictions compared with the actual data that we held back.

nested_best_tbl <- read_rds("best_models.rds")

nested_best_tbl %>%
    extract_nested_test_forecast() %>%
    group_by(state) %>%
    plot_modeltime_forecast(.facet_ncol = 2)

Final Model Refits

This is the final step in the modeling process where we fit the most accurate models to the data including the data that we held back.

nested_best_refit_tbl <- nested_best_tbl %>%
    modeltime_nested_refit(
        control = control_refit(
            verbose   = TRUE,
            allow_par = TRUE))

parallel_stop()

write_rds(nested_best_refit_tbl, "nested_best_refit_tbl.rds")

Visualize the Forecasts

Here are the final future forecasts by state. The line represents the expected value or average of the forecast, and the gray area represents the 95% confidence intervals around the prediction. Obviously, no forecast will be absolutely correct, so the confidence intervals give us a reasonable expectation around where the true value will appear in the future.

nested_best_refit_tbl <- read_rds("nested_best_refit_tbl.rds")

nested_best_refit_tbl %>%
    extract_nested_future_forecast() %>%
    mutate(.conf_lo = case_when(.conf_lo < 0 ~ 0,
                                TRUE ~ .conf_lo)) %>% 
    group_by(state) %>%
    plot_modeltime_forecast(.facet_ncol = 2)

Summary

Mass shootings incidents are increasing since 2014 at an average annual rate of 14%. Since 2020, the number of mass shootings have increased, and we appear to have a new higher ‘normal’ environment of mass shootings in the US.

Saturdays and Sundays tend to have more mass shootings and account for about half of the mass shooting weekdays whereas Monday through Friday account for the remaining half.

Summer months account for 35% of mass shootings where winter months account for 15% leaving fall and spring similar at approximately 25% each proportionately.

Only 2 of fifty states since 2014 have not had a mass shooting - Hawaii and North Dakota.

The top five states (including Washington, DC) with the most total mass shootings since 2014 are Illinois, California, Texas, Florida, and Pennsylvania.

Per capita, Washington, D.C. experiences the most mass shootings, injuries, and deaths. This means that on a per person basis, you are more likely to experience a mass shooting in D.C. In fact, you are three times more likely on a per capita basis than the next closest state which is Delaware in 2021.

Illinois, which is highest in total volume, also makes the list for states with the highest mass shootings, injuries, and deaths per capita. The other top volume states, California, Texas, Florida, and Pennsylvania do not share that distinction in 2021.

The top five US cities with the most mass shooting incidents are Chicago, IL, Philadelphia, PA, Baltimore, MD, Houston, TX, and Saint Louis, MO.

The time series forecasts reveal trends where these 4 states will predictably continue to see regular mass shooting incidents and even potentially see more as time goes on.